feat(db): improve `GET /annotate/all` performance by labradorite-dev · Pull Request #2 · labradorite-dev/data-source-manager

labradorite-dev · 2026-03-08T23:59:14Z

Summary

Improves GET /annotate/all performance.

Background Context

Issue Police-Data-Accessibility-Project#566 describes improving performance of an endpoint. However, before improving performance I first wanted to measure current performance to drive the decision with data. As such I established a benchmark using pytest-benchmark. To back the benchmark with scale, it seeds 10k URLs to populate some relatively realistic annotation data.

The benchmark determined that main_query_s was by far the largest portion of wall-clock time in this endpoint and so I focused my efforts on improving that performance. I considered a number of different solutions but by far the most effective is to convert the url_annotation_count_view and url_annotation_flags from regular PostgreSQL views to materialized views with unique indexes on url_id.

Performance Improvement

Measured locally with pyinstrument via the new *_profiled benchmark tests:

Test	Before (dev)	After	Improvement
Small dataset (`profiled`)	12.21 ms	5.99 ms	-51%
10k URLs (`scale_profiled`)	41.72 ms	3.53 ms	-92%

Staleness Tradeoff

I added the materialized views to the existing RefreshMaterializedViewsOperator schedule which is once per day. In the worst case, URL ranking could be up to 24 hours stale - e.g. a URL that just received its 10th annotation may still be ranked highly when it should be deprioritized. My personal viewpoint is that this is acceptable because:

No user-facing data is incorrect - only the ordering of work is affected
No writes or authorization decisions are gated on these values
Annotators are directed to URLs in a slightly suboptimal order, not a wrong one
Follows the same trade-off already accepted by the 3 existing materialized views in the codebase

If fresher data is a product priority, I would recommend a more aggressive refresh rate (such as hourly) for the materialized view.

Alternative Long Term Solution

I would be remiss to not include the other, much larger refactor I considered which is to implement proper CQRS. The core of the issue here is that data is prioritized for writes, NOT reads. Then, at query time we pay the price and have to stitch all the data together. Materialized views shift that data stitching slightly left (once a day in a separate job), but a potentially more "complete" fix would be to add a new, fully denormalized table which looks like this:

-- read model, updated on every write command
annotation_queue (
    url_id        PRIMARY KEY,
    priority_rank INT,  -- pre-computed: manual > followed > count > id
    total_count   INT,
    is_manual     BOOL,
    followed_by_user BOOL,
    ...
)

Then at each time we write a new annotation to be consumed by annotators, we also populate this table correctly. Then, at query time we have a primary key lookup (extremely cheap): SELECT * FROM annotation_queue ORDER BY priority_rank LIMIT 1.

If we are interested, I do think this could be a new issue if read performance will continue to be prioritized for the app.

Test plan

All 4 benchmark tests pass (test_benchmark_annotate_all_*)
Migration applies cleanly from scratch (alembic upgrade head)
CONCURRENTLY refresh works (unique indexes in place as prerequisite)

…-Data-Accessibility-Project#566) - Add pytest-benchmark~=4.0 dev dependency - Add ContextVar-based timing collector (timing.py) for zero-cost production instrumentation via _phase() context managers - Instrument extract_and_format_get_annotation_result() with per-phase timing: format_s, agency_suggestions_s, location_suggestions_s, name_suggestions_s, batch_info_s - Instrument GetNextURLForAllAnnotationQueryBuilder.run() with main_query_s timing - Add benchmark test suite under tests/automated/integration/benchmark/ with HTTP round-trip and per-phase breakdown tests - Add GHA workflow (.github/workflows/benchmark.yml) that runs on workflow_dispatch or PR to dev, uploads JSON artifact per commit SHA - Add README with Mermaid sequence diagram of the measured call chain

…olice-Data-Accessibility-Project#566) Add scale_seed.py with create_scale_data() that seeds 10k URLs plus geographic hierarchy, agencies, name/agency/location suggestions to exercise the query planner under realistic load. Add scale_seeder fixture (module-scoped) and two new benchmark tests that mirror the existing small-data pair, printing per-phase averages for direct comparison.

…olice-Data-Accessibility-Project#566) The sequential agency_auto_suggestions / add_location_suggestion loops (4 DB calls × 5k iterations each) blew through the 300s pytest timeout during fixture setup. Replace with 3 round-trips per suggestion type: initiate_task → bulk_insert subtasks (return_ids) → bulk_insert suggestions.

…ata-Accessibility-Project#566) SuggestionModel.robo_confidence is typed int; Pydantic rejects float values with fractional parts (e.g. 0.9). Use 1.0 which coerces cleanly.

…ags (Police-Data-Accessibility-Project#566) Convert both annotation views from regular views to MATERIALIZED VIEWs with unique indexes on url_id. Add CONCURRENTLY refresh calls to refresh_materialized_views(). Drops main_query_s from ~177ms to ~0.64ms at 10k-URL scale (~276x improvement).

…test (Police-Data-Accessibility-Project#566) With materialized views, URLs added between refreshes have no row in the view. Inner joins excluded them entirely, making new URLs invisible to annotators. Switch to LEFT OUTER JOINs so URLs without a view row still appear with NULL/0 counts. The sort test explicitly verifies annotation-count-driven ordering, so it needs a manual refresh_materialized_views() call after data setup to reflect counts before querying.

…ect#566) - Migration: Union[str, None] -> Optional[str], add docstrings to upgrade/downgrade - async_.py: add missing newline at end of file

… heads (Police-Data-Accessibility-Project#566) Upstream dev merged 1fb2286a016c (add_internet_archive_to_batch_strategy) which also chains from 759ce7d0772b, creating two Alembic heads. Update our migration's down_revision to sit after upstream's.

…lity-Project#566)

…ata-Accessibility-Project#566)

…ce-Data-Accessibility-Project#566)

…credentials

…e-Data-Accessibility-Project#566)

…le_seed (Police-Data-Accessibility-Project#566)

…y-Project#566)

…k_summary.py (Police-Data-Accessibility-Project#566)

…o docker-compose GitGuardian flagged the literal default password added to docs/development.md. Replaced with a pointer to local_database/docker-compose.yml where it is already defined.

- docs(benchmark): define 'phase' clearly in README - ci(benchmark): restrict benchmark workflow to workflow_dispatch only - fix(migration): restore internet_archive migration to match dev (style-only drift from lint passes) - fix(annotate): remove isouter from mat view joins; inner join is correct since missing rows are unusable

…Police-Data-Accessibility-Project#566)

…olice-Data-Accessibility-Project#566)

labradorite-dev added 21 commits February 28, 2026 14:29

fix(benchmark): use confidence=1.0 for location suggestions (Police-D…

e9db7d3

…ata-Accessibility-Project#566) SuggestionModel.robo_confidence is typed int; Pydantic rejects float values with fractional parts (e.g. 0.9). Use 1.0 which coerces cleanly.

chore(deps): upgrade pytest-benchmark from 4.0 to 5.2.3

2c414e5

style: fix lint issues in our changes (Police-Data-Accessibility-Proj…

90118a0

…ect#566) - Migration: Union[str, None] -> Optional[str], add docstrings to upgrade/downgrade - async_.py: add missing newline at end of file

ci(benchmark): post results to GHA job summary (Police-Data-Accessibi…

f826efd

…lity-Project#566)

ci(benchmark): capture per-phase timings in GHA job summary (Police-D…

d0861d0

…ata-Accessibility-Project#566)

fix(benchmark): use removesuffix to avoid mangling phase labels (Poli…

864a8bd

…ce-Data-Accessibility-Project#566)

docs(benchmark): simplify README; remove stale diagram and hardcoded …

72a1810

…credentials

style: fix flake8 warnings in benchmark test and timing module (Polic…

ed5a7e5

…e-Data-Accessibility-Project#566)

style: add missing docstrings to benchmark package, conftest, and sca…

9aae5a4

…le_seed (Police-Data-Accessibility-Project#566)

style: fix flake8 annotations from PR check (Police-Data-Accessibilit…

00f24ec

…y-Project#566)

refactor(benchmark): extract summary heredoc to scripts/post_benchmar…

35248e0

…k_summary.py (Police-Data-Accessibility-Project#566)

docs(development): replace hardcoded dev db password with reference t…

80b77b8

…o docker-compose GitGuardian flagged the literal default password added to docs/development.md. Replaced with a pointer to local_database/docker-compose.yml where it is already defined.

refactor(benchmark): replace _phase timing with pyinstrument profiler (…

4f37aed

…Police-Data-Accessibility-Project#566)

fix(tests): refresh materialized views in annotation setup helpers (P…

3e710ea

…olice-Data-Accessibility-Project#566)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(db): improve `GET /annotate/all` performance#2

feat(db): improve `GET /annotate/all` performance#2
labradorite-dev wants to merge 21 commits intodevfrom
issue-566-optimize-annotation-load-time

labradorite-dev commented Mar 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

labradorite-dev commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background Context

Performance Improvement

Staleness Tradeoff

Alternative Long Term Solution

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

labradorite-dev commented Mar 8, 2026 •

edited

Loading